Character Classes
The ability to match a class, or collection, of characters at a specific point in a target string permits patterns that can match a range of text.
Including a class of characters to be possibly matched is achieved through one of three methods:
the dot metacharacter;
character classes;
class shorthands
dot -- the match-any-character class (.)
A class that matches any character except the null character '\0'. Since it matches almost any character, it is the most general of all possible character classes.Example
regex.easyMatch ("c.t","catheter")
» true
Character classes ([...])
A character class, also known as a "list" and "bracket expression", is a list of one or more items. The list is defined through the items included between the squarebrackets, "[...]".An item in a character class can be either an ordinary character, representing itself, or a metacharacter. However, the definitions for metacharacters within a character class are different from those metacharacters outside of character classes.
Example
"[abc]" matches either "a" or "b" or "c"."Defen[sc]e" will match either "Defense" or "Defence"
If you want to include a "]" in a character list, either include it as the first character (eg "[]]"), or escape it using a backslash (eg "[\\]]").
character-class metacharacters
Character classes have their own rules for what are and what aren't metacharacters. Something that is a metacharacter outside of a character class may not be a metacharacter inside a character class.
For example, the dot metacharacter is just a plain a dot inside a character class.
- the dash The dash indicates a range of characters. A range is formed by placing a dash between two characters.The range represented falls between the beginning and ending elements in the ASCII sequence. Examples
- "[a-z]" is equivalent to "[abcdefghijklmnopqrstuvwxyz]"
- "[0-9]" is the same as "[0123456789]"
- "<H1>[a-zA-Z0-9 ]+</H1>" may match a level 1 heading in HTML code.
Cases when the dash is not a metacharacter inside a character class:
the dash is the first or last character in the list;
the dash is the last character in a range;
the dash is escaped with a backslash "\".^ the caret If the caret is the first element in the list, the character class matches any character that is not in the list. [^...] classes are known as negated character classes
Examples
- "[^a-z]" matches any character that is not a lower case alphabetical character.
- "<!--[^>]+--!>" will match HTML comments - "[^>]+" means match any character up until a ">" occurs.
\ the escape The escape allows character class metacharacters to be represented as themselves. When using an escape in a pattern, the backslash itself needs to be escaped to enable Frontier to pass the escape to the regex engine.
Example
"[,\\-\\]]" matchs a comma, a dash, and a closing square bracket.[:...:] POSIX
bracket
expressionsA POSIX* Bracket expression** contains one of several special class shortcuts These character shortcuts are only valid within character classes
Examples
regex.easyMatch ("[[:alpha:]]", "Ë")
» trueregex.easyMatch ("[:alpha:]", "Ë")
» false - because it attempts to match the class ":", "a", "l", "p" and "h" against "Ë".The supported POSIX characters shortcuts are:
alnum letters (including diacritical characters) and digits. alpha letters (including diacritical characters). blank a space or tab. cntrl control characters in the ASCII encoding (ie codes less than 32 and code 127). digit digits - 0123456789. graph same as "print" except omits space. lower lowercase letters - including diacritical characters. printable characters (in the ASCII encoding, space tilde--codes 32 through 126). punct neither control nor alphanumeric characters. space space, carriage return, newline, tab, and form feed. upper uppercase letters - including diacritical characters. xdigit hexadecimal digits: "0"-"9", "a"-"f", "A"-"F".
Class shorthands
Class shorthands are shortcuts for a character class.When using an escape, "\", in a pattern, the backslash itself needs to be escaped to enable Frontier to pass the escape to the regex engine.
\d Digit Match any digit. It is equivalent to "[0-9]" \D Non-digit Match any character that is not a digit. It is equivalent to "[^0-9]" \s Whitespace Match any whitespace character - horizontal tab, line feed, vertical tab, form feed, carriage return and space. \S Non-whitespace Match any character that is not whitespace. \w Word character Match any character that can be part of a word. It is similar to "[a-zA-Z0-9_]" except that it also includes all characters with diacritic marks. \W Non-word character Match any character that cannot be part of a word. It is similar to "[^a-zA-Z0-9_]" except that it also excludes all characters with diacritic marks.
* POSIX - is short for Portable Operating System interface - a standard for ensuring portability across operating systems.
** Actually, a POSIX bracket expression is what we call a character class, and POSIX uses the term "character class" for the metasequences inside a bracket expression. We'll stick with the standard regular expression nomenclature.